# GraphGAP Reproducible Dataset (v3)

This dataset package consolidates the **data artifacts used in the paper** (tables, corpus register, scoring outputs templates) to support **reproducibility**.

## Contents
- `tables/`: CSV exports of all tables embedded in the manuscript (`tables_index.csv` provides a map).
  - `requirement_level_summary_stats.csv`: Requirement-level statistics (n, GapScore mean + 95% bootstrap CI, Readiness P80 + CI, Share>=3).
  - `enforcement_corpus_register_n32.csv`: External enforcement/regulatory hard-corpus register (DocID, jurisdiction/regulator, type, year, title, source URL + retrieval keywords).
  - `ablation_design_grid.csv`: Ablation configuration grid (pages/tokenizer/ngram/tau calibration etc).
  - Other tables: schema, rubrics, reliability summary, action-tier mapping, etc.
- `materials/`: Key source documents available in this workspace (paper PDFs, UNICEF-related PDF if provided).
- `code/reproduce_tfidf_hardcorpus.py`: Reference implementation (TF-IDF embedding + calibrated tau + ablation summary).
  - Note: To run it end-to-end you need **local text files** for each DocID (e.g., `local_docs/D1.txt` ...). 
  - The paper provides **official source URLs + retrieval keywords** so you can download documents from official sites and convert to text.

## How to reproduce (high-level)
1. Use `tables/enforcement_corpus_register_n32.csv` to retrieve each official document via:
   - `Source URL (official) + retrieval keywords`.
2. Convert each PDF to text (e.g., `pdftotext`) and save as `local_docs/<DocID>.txt`.
3. Run:
   ```bash
   python code/reproduce_tfidf_hardcorpus.py --register tables/enforcement_corpus_register_n32.csv --queries tables/requirement_queries.csv --local_docs local_docs --outdir outputs
   ```
4. Inspect:
   - `outputs/ablation_summary.csv`
   - `outputs/missing_local_docs.csv`

## Provenance
- Built from manuscript: `eb498d8d-02e1-4bda-971a-518575392617.docx`
- Build time: 2025-12-19T17:30:10.524666Z

